xarray zarr - Python for climatology, oceanograpy and atmospheric science

xarray zarr

Reading

Instead of waiting for full datasets, process data incrementally as it arrives (e.g., from simulations).

code:python

# Pseudocode pattern

for chunk in xr.open_zarr("simulation.zarr", chunks={"time": 1}):

process(chunk) # e.g., accumulate statistics

Treat multi-run, multi-model archives as a tree: DataTree + Zarr v3

If you work with ensembles (different model configurations, perturbations, parameter sweeps), representing them as a hierarchy can be cleaner than forcing everything into one Dataset. Recent xarray releases emphasize DataTree and even mention reading Zarr v3 datasets into a DataTree.

code:python

import xarray as xr

# Example idea: keep each experiment under a node (pseudo-layout)

# /control, /perturbed, /highres ... each a dataset

dt = xr.open_datatree("experiments.zarr", engine="zarr") # layout-dependent

# Compare two runs cleanly

delta_ssh = dt"highres".ds"zos".mean("time") - dt"control".ds"zos".mean("time")

Why it matters in oceanography: it keeps metadata + provenance for runs together while still letting you compute “differences of products” naturally.

IO parallelism tuning beyond chunking (file system + scheduler interplay)

Two underused levers:

File-level parallelism: many small Zarr chunks vs fewer large ones depends on object store vs POSIX.

Scheduler choice: threaded vs processes vs distributed.

Pattern:

code:python

import dask

dask.config.set(scheduler="threads") # good for IO-bound workloads

ds = xr.open_zarr("ocean.zarr", chunks={}) # defer chunking decisions

Empirically:

object storage → more, smaller chunks (parallel GETs)

HPC filesystem → fewer, larger chunks (reduce metadata ops)

Object-store friendly I/O: read only what you need via fsspec, and align access windows with chunk layout

For Zarr or remote datasets, I/O performance is dominated by how your slice pattern matches chunking. If you typically read “time windows + spatial tiles,” design chunking (and your selections) accordingly.

code:python

import xarray as xr

ds = xr.open_zarr(

"s3://bucket/model.zarr",

chunks={"time": 24, "y": 1024, "x": 1024},

consolidated=True,

storage_options={"anon": False}, # adjust to your auth

)

# Example: time window + spatial tile (often best when aligned to chunk boundaries)

sub = (ds"ssh"

.sel(time=slice("2015-01-01", "2015-01-31"))

.isel(y=slice(0, 2048), x=slice(0, 2048)))

hr.icon

Writing

Chunk-shape “tiling” for multi-variable workflows

When multiple variables are used together (e.g., u, v, T, S), misaligned chunks cause repeated rechunking.

Solution: enforce a shared chunk template at write time.

code:python

encoding = {var: {"chunks": (24, 512, 512)} for var in ds.data_vars}

ds.to_zarr("ocean_aligned.zarr", encoding=encoding)

This avoids hidden rechunk costs during multi-variable operations like fluxes or budgets.

Write-time compression tuning (Blosc/Zstd tradeoffs)

Compression is not just about size—it affects read speed.

code:python

encoding = {

"thetao": {

"compressor": zarr.Blosc(cname="zstd", clevel=3, shuffle=2)

}

ds.to_zarr("optimized.zarr", encoding=encoding)

Typical pattern:

moderate compression (clevel ~3–5) → best throughput

very high compression → slower reads, often not worth it

Use DataTree for multi-resolution / multi-product ocean archives

A useful emerging pattern is to represent related ocean products as a hierarchy rather than as many loosely connected Datasets: e.g., /raw_sst, /daily_sst, /fronts, /eddies, /climatology, each with its own grid and metadata. Xarray now has first-class DataTree support, and DataTree.chunk() can rechunk arrays across groups; DataTree.to_zarr() / open_datatree() make this natural for hierarchical Zarr stores. This is especially attractive for model–observation matchup archives, nested models, SWOT swath + gridded products, or glider profiles grouped by deployment.

code:python

import xarray as xr

tree = xr.DataTree.from_dict({

"/model/hourly": model_ds,

"/obs/argo": argo_ds,

"/diagnostics/mld": mld_ds,

})

tree = tree.chunk({"time": 30, "lat": 256, "lon": 256})

tree.to_zarr("ocean_matchup_hierarchy.zarr", mode="w")

A practical analytical workflow is to keep raw and derived diagnostics in the same store, but isolate their coordinates and chunking by group. That avoids forcing, say, profile data and gridded SSH onto a single artificial schema.

hr.icon

Appending/Update

Operational Zarr workflows: append and partial updates (append_dim / region)

Instead of rewriting entire archives, design for incremental updates (daily runs, patches, reprocessed tiles). This can drastically reduce I/O.

code:python

# Initial write

ds0.to_zarr("ssh.zarr", mode="w", consolidated=True)

# Daily append along time

ds_new.to_zarr("ssh.zarr", mode="a", append_dim="time", consolidated=True)

# Patch a subregion (overwrite a spatial tile for a specific time slice)

region = {"time": slice(ti, ti+1), "y": slice(2000, 2600), "x": slice(3000, 3600)}

ds_patch.to_zarr("ssh.zarr", mode="r+", region=region)

Tip: align patch regions with Zarr chunk boundaries whenever possible; misalignment can cause extra reads/writes.

Use region writes for streaming model output or daily satellite updates

For operational or semi-operational ocean workflows, append-like writes can be slow and fragile. Xarray’s Dataset.to_zarr(region=...) supports writing into pre-existing Zarr arrays, but the documentation warns that region boundaries, Zarr chunks, and Dask chunks must align; otherwise incomplete chunk writes can corrupt data.

code:python

template = xr.zeros_like(ds_day0).expand_dims(time=pd.date_range("2026-01-01", periods=366))

template.to_zarr(

"sst_daily_2026.zarr",

mode="w",

compute=False,

encoding={"sst": {"chunks": (1, 720, 1440)}},

)

# Later: write one day at a time, aligned to one time chunk

daily_ds.chunk({"time": 1, "lat": 720, "lon": 1440}).to_zarr(

"sst_daily_2026.zarr",

region={"time": slice(day_index, day_index + 1)}

)

This is useful for building rolling marine heatwave, SST-front, sea-ice-edge, or altimetry anomaly archives without rewriting the full store.

#zarr

#xarray